Model Selection

Document Image Understanding

# Document Image Understanding

Qwen2.5 VL 72B Instruct FP8 Dynamic

FP8 quantized version of Qwen2.5-VL-72B-Instruct, supporting vision-text input and text output, optimized and released by Neural Magic.

Transformers English

Olmocr 7B 0225 Preview

A document OCR model fine-tuned based on Qwen2-VL-7B-Instruct, supporting multilingual document recognition and metadata extraction

Text Recognition

Transformers English

Qwen2.5 VL 3B Instruct Quantized.w4a16

The quantized version of Qwen2.5-VL-3B-Instruct, with weights quantized to INT4 and activations quantized to FP16, designed for efficient vision-text task inference.

Transformers English

Qwen2.5 VL 72B Instruct FP8 Dynamic

The FP8 quantized version of Qwen2.5-VL-72B-Instruct, supporting vision-text input and text output, suitable for multimodal tasks.

Transformers English

Eagle2 is a high-performance series of vision-language models focused on enhancing model performance through optimized data strategies and training methods. Eagle2-9B is the large model in this series, achieving a good balance between performance and inference speed.

Transformers Other

KnutJaegersberg

Eagle 2 is a high-performance vision-language model family that focuses on transparency in data strategies and training schemes, aiming to drive the open-source community in developing competitive vision-language models.

Transformers Other

Paligemma2 10b Ft Docci 448

PaliGemma 2 is a multi-functional vision-language model (VLM) launched by Google, which combines image and text processing capabilities and supports multilingual and multi-task processing.

Florence 2 DocVQA

A version fine-tuned for 1 day using the Docmatix dataset (5% data volume) based on Microsoft's Florence-2 model, suitable for image-text understanding tasks

Paligemma 3b Ft Docvqa 896

PaliGemma is a lightweight vision-language model developed by Google, built on the SigLIP vision model and the Gemma language model, supporting multilingual image-text understanding and generation.

Uae License Detection

Donut is an OCR-free document understanding Transformer model that combines a visual encoder and text decoder to process document images

Donut Base Medical Handwritten Prescriptions Information Extraction Final

A medical handwritten prescription information extraction model based on the Donut architecture, designed to extract structured information from medical prescription images

A model fine-tuned based on naver-clova-ix/donut-base, specific uses and functions require more information

Donut Base Sroie

A document understanding model fine-tuned from naver-clova-ix/donut-base, specialized in structured document information extraction tasks

Text Recognition

A document understanding model fine-tuned from naver-clova-ix/donut-base, suitable for image folder datasets

Text Recognition

Donut Base Sroie

A model fine-tuned on the image folder dataset based on naver-clova-ix/donut-base, suitable for document understanding tasks

Text Recognition

Donut Base Sroie Fine Tuned

A fine-tuned version based on the naver-clova-ix/donut-base model on an image folder dataset, suitable for document understanding tasks.

Text Recognition

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase